<img src="http://nci.org.au/wp-content/themes/nci/img/img-logo-large.png", width=400>

Point data manipulation with PDAL

This notebook is an incomplete training document for NCI systems. All pipelines and commands used here have been tested against PDAL 1.3 on a centOS system, although you will need to modify path and file names to suit your systems.

PDAL is not yet installed as a production system at NCI. We expect it will be available via NCI's virtual desktop infrastructure for NCI users in early 2017. Keep an eye on http://vdi.nci.org.au/news for details.

In this notebook:

  • manipulating point data on the NCI filesystem using PDAL (http://pdal.io, v1.3)

The following material uses Geoscience Australia's Elevation Data Collection which is available under the Create Commons License 4.0 through NCI's THREDDS Data Server. For more information on the collection and licensing, please click here.

What is PDAL?

http://pdal.io

PDAL (the Point Data Abstraction Library) is an open source library for handling massive point data sets. It's name is derived from GDAL - since it aims to sit in the same space for point data.

PDAL is actually a C library - if you're writing applications you can insert it into your code. It also has python bindings. Today we'll explore some of PDAL's capabilities using it's command line applications - which are mostly wrappers to PDAL's pipeline functions.

We'll also us a sneaky bit of LibLAS: http://www.liblas.org

...but you'll hopefully see why we'd prefer PDAL in the end.

Why would I use PDAL? Why is it here on data day?

If you deal with:

  • LiDAR elevation observations
  • 3D photogrammetry
  • Any other data which consists of dense, ungridded points

...PDAL can make your life a lot simpler with some basic processing tasks.

PDAL is demonstrated at this workshop because point data exist on NCI's filesystem, and the VDI is the perfect place to work on them without having to ship unneccessary data. Using the VDI as a front end for development is an excellent way to protoype point data analyses on data that exist in the NCI filesystem; or define specific subsets to take away for analysis.

Agenda for this workbook

A lightning speed overview of point data handling and manipulation:

  1. Getting information about a point dataset
  2. Collecting a subset from a LiDAR survey
  3. Requesting only a specific point class from a dataset
  4. Classifying ground (in case you don't like the vendor's version of 'ground')
  5. Generating a bare earth DEM and a DSM
  6. Requesting 'height above ground' instead of 'absolute height'

We will do all this on the command line, viewing results in CloudCompare or QGIS. These tasks are based on the PDAL workshop here: http://www.pdal.io/workshop/index.html, and are very much 'learn by doing'. PDAL is very well documented, please keep reading for more information.

...so feel free to zoom ahead, create and share!

Set up

module purge
module load QGIS GDAL PDAL cloudcompare

(to verify)

A canned QGIS project with all the layers shown in this workbook is here:

/path/to/materials/pdal_demo.qgis

Feel free to explore it, or build your own example and apply the tools you see here.

Locate data

We will use a LiDAR survey over Merimbula in 2013, from the Geoscience Australia elevation reference collection. Here is it's catalogue entry:

THREDDS

http://dapds00.nci.org.au/thredds/catalog/rr1/Elevation/Merimbula0313/catalog.html

Geonetwork

https://geonetwork.nci.org.au/geonetwork/srv/eng/catalog.search#/metadata/f9082_1236_9859_8989

Licensed under Creative Commons (CCBY4): http://dapds00.nci.org.au/thredds/fileServer/licenses/rr1_licence.pdf

The path to the data on your VDI desktop is:

/g/data1/Elevation/Merimbula0313/

...and LAS tiles are in:

/g/data1/Elevation/Merimbula0313/Tiles_2k_2k

1. Basic information

Try:

pdal info /path/to/Elevation/Merimbula0313/z55/2013/Mass_Points/LAS/AHD/LAS/Tiles_2k_2k/Merimbula2013-C3-AHD_7605910_55_0002_0002.las

...and compare with:

lasinfo /path/to/Elevation/Merimbula0313/z55/2013/Mass_Points/LAS/AHD/LAS/Tiles_2k_2k/Merimbula2013-C3-AHD_7605910_55_0002_0002.las


Lasinfo gives more compact results - but can only read LAS. PDAL's info function can tell you about dimensions in any dataset it has a schema for reading: http://www.pdal.io/stages/readers.html, which hints also that PDAL can process point data in a diverse range of data formats.

2. Clipping point data with PDAL

Straight into the fire! We're going straight to PDAL's pipeline architecture, which gives it an enormous amount of felxibility and power. A pipeline is a set of operations chained together and defined in a JSON file. You'll see it in action here

a. Why do we want to clip LAS data?

LAS tiles are pretty hard to handle - you get a lot of extra data that you may not want, and they are pretty much always boringly square. If we only need a certain region, we can get just those points using PDAL.

b. An example - selecting Merimbula town

The image here shows a lot of data - an index of LiDAR tiles labelled by filename, an OpenStreetMaps layer giving us an idea of where Merimbula town is, and a polygon roughly around the urban area. These are all in the QGIS project given above.

The image here shows that Merimbula town covers several LIDAR tiles. This is an excellent challenge - it means some tiles have a lot of data we don't want, and it means we have to sift through multiple tiles to find our data of interest. We can see a rough polygon surrounding our region of interest, which is defined in QGIS and save as GeoJSON.

Here's the GeoJSON polygon:

{
"type": "FeatureCollection",
"crs": { "type": "name", "properties": { "name": "urn:ogc:def:crs:EPSG::28355" } },
"features": [
{ "type": "Feature", "properties": { "id": 0 }, "geometry": { "type": "Polygon", "coordinates": [ [ [ 759094.480855233967304, 5913008.271593709476292 ], [ 758464.699909413931891, 5912716.199270982295275 ], [ 757743.646362751838751, 5912898.744472593069077 ], [ 757716.26458250079304, 5913304.907546310685575 ], [ 757373.992329337401316, 5913418.998297326266766 ], [ 757018.029186049010605, 5913724.761510098353028 ], [ 757556.537531022448093, 5913784.088700683787465 ], [ 757828.153738587978296, 5913997.946465536952019 ], [ 757828.153738587396219, 5914326.52782854065299 ], [ 758357.534823469701223, 5914381.291389083489776 ], [ 758877.788648267393, 5914554.709330711513758 ], [ 758850.406868015765212, 5914810.272613044828176 ], [ 759042.079329782165587, 5914837.654393311589956 ], [ 759151.606450793216936, 5914673.363711818121374 ], [ 759370.660692813224159, 5914709.872752171941102 ], [ 759361.533432727912441, 5915102.34493575617671 ], [ 760593.713544093072414, 5915138.853976195678115 ], [ 761177.858189482591115, 5915047.581375411711633 ], [ 761123.094628979102708, 5914235.255227984860539 ], [ 761260.003530243760906, 5914007.07372591085732 ], [ 761570.33037310524378, 5913952.31016543880105 ], [ 761369.530651255394332, 5913559.837981833145022 ], [ 761141.349149147979915, 5913459.438120897859335 ], [ 760484.186423085397109, 5913377.292780089192092 ], [ 759817.896436938317493, 5913632.856062367558479 ], [ 759516.696854161447845, 5913550.710721591487527 ], [ 759416.29699323582463, 5913286.020179163664579 ], [ 759094.480855233967304, 5913008.271593709476292 ] ] ] } }
]
}

...but PDAL needs WKT (for now) - using this website: http://rodic.fr/blog/online-conversion-between-geometric-formats/, we can get a WKT polygon:

POLYGON((759094.480855234 5913008.2715937095,758464.6999094139 5912716.199270982,757743.6463627518 5912898.744472593,757716.2645825008 5913304.907546311,757373.9923293374 5913418.998297326,757018.029186049 5913724.761510098,757556.5375310224 5913784.088700684,757828.153738588 5913997.946465537,757828.1537385874 5914326.527828541,758357.5348234697 5914381.2913890835,758877.7886482674 5914554.7093307115,758850.4068680158 5914810.272613045,759042.0793297822 5914837.654393312,759151.6064507932 5914673.363711818,759370.6606928132 5914709.872752172,759361.5334327279 5915102.344935756,760593.7135440931 5915138.853976196,761177.8581894826 5915047.581375412,761123.0946289791 5914235.255227985,761260.0035302438 5914007.073725911,761570.3303731052 5913952.310165439,761369.5306512554 5913559.837981833,761141.349149148 5913459.438120898,760484.1864230854 5913377.292780089,759817.8964369383 5913632.856062368,759516.6968541614 5913550.7107215915,759416.2969932358 5913286.020179164,759094.480855234 5913008.2715937095))

c. Making a list of LIDAR tiles.

We need to know which tiles contain our data - by labelling the tile index layer in QGIS we can pick out the following set:

Merimbula2013-C3-AHD_7565916_55_0002_0002.las
Merimbula2013-C3-AHD_7565914_55_0002_0002.las
Merimbula2013-C3-AHD_7565912_55_0002_0002.las
Merimbula2013-C3-AHD_7585916_55_0002_0002.las
Merimbula2013-C3-AHD_7585914_55_0002_0002.las
Merimbula2013-C3-AHD_7585912_55_0002_0002.las
Merimbula2013-C3-AHD_7605916_55_0002_0002.las
Merimbula2013-C3-AHD_7605914_55_0002_0002.las
Merimbula2013-C3-AHD_7605912_55_0002_0002.las


PDAL supports filename globbing for pipelines - so you could just send "../merimbula2013/Tiles_2k_2k/*.las" to the pipeline below. However, you'll find pretty quickly that PDAL's *tindex* application, and using the resulting index to tell us where tiles and polygons intersect (without needing to know tile names) is vastly more efficient. However, for this demonstration we want to avoid creating extra datasets - so we list our tiles of interest.

d. Constructing a PDAL pipeline

We create a JSON file which tells PDAL what to do:

nano merimbula_clip.json

..and paste in the following:

{
    "pipeline": [

        "../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7565916_55_0002_0002.las",
        "../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7565914_55_0002_0002.las",
        "../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7565912_55_0002_0002.las",
        "../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7585916_55_0002_0002.las",
        "../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7585914_55_0002_0002.las",
        "../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7585912_55_0002_0002.las",
        "../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7605916_55_0002_0002.las",
        "../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7605914_55_0002_0002.las",
        "../merimbula2013/Tiles_2k_2k/Merimbula2013-C3-AHD_7605912_55_0002_0002.las",
        {
            "type": "filters.crop",
            "polygon": "POLYGON((759094.480855234 5913008.2715937095,758464.6999094139 5912716.199270982,757743.6463627518 5912898.744472593,757716.2645825008 5913304.907546311,757373.9923293374 5913418.998297326,757018.029186049 5913724.761510098,757556.5375310224 5913784.088700684,757828.153738588 5913997.946465537,757828.1537385874 5914326.527828541,758357.5348234697 5914381.2913890835,758877.7886482674 5914554.7093307115,758850.4068680158 5914810.272613045,759042.0793297822 5914837.654393312,759151.6064507932 5914673.363711818,759370.6606928132 5914709.872752172,759361.5334327279 5915102.344935756,760593.7135440931 5915138.853976196,761177.8581894826 5915047.581375412,761123.0946289791 5914235.255227985,761260.0035302438 5914007.073725911,761570.3303731052 5913952.310165439,761369.5306512554 5913559.837981833,761141.349149148 5913459.438120898,760484.1864230854 5913377.292780089,759817.8964369383 5913632.856062368,759516.6968541614 5913550.7107215915,759416.2969932358 5913286.020179164,759094.480855234 5913008.2715937095))"
            "outside": false
        },
    "./merimbulatown_clip.las"
    ]
}

e. Apply our clipping operation

Then we execute the task using:

pdal pipeline merimbula_clip.json

Time to execute on an 8 core, 16GB RAM VM: 1 minute 12 seconds


Can we do better that that?

**Yes!** If we first generate a tile index and give PDAL a better idea of which data to use, we can do this job in **49 seconds** including compression to LAZ using PDAL's tindex reader. This still generates a 46 mb LAz file (388 mb uncompressed LAZ), so a roundtrip time of a couple of minutes would be expected from a web-based request to a data subset on your machine.

Note

PDAL also supports file name globbing in pipelines. However, giving our clip

This will result in a set of points inside your polygon being written into a .LAS file at the location specified in the pipeline file. Now you have a template for doing this job with pretty much any LAS tiles!

How much data do we have? pdal info ./merimbulatown_clip.las tells us we have 11948596 points - so our time to process is really not bad (read 9 las tiles, do some clipping, merge and write an 11.9 million point dataset).

So let's explore our new dataset. In a new terminal, type:

cloudcompare &

...and use it's file/open menu to navigate to your newly made LAS file. Take a look at it there (hint - use the projections menu to convert the Z dimension to a scalar field to colour your points by height).

Here's an example. Note Cloudcompare's default point colouring scheme is 'point source' - which is useful but not pretty. Use the 'scalar fields' dropdown in the lower left panel to change your colour scheme - the screenshot here uses intensity.


Caution

If your polygon is quite large, or your points very dense, or both, you can still get a massive dataset! Use pdal info to get an estimate of how dense the data are, and figure out how much area you are clipping to estimate the final file size before going ahead.

Extension

Colour your points by height.

3. I want only buildings

If you only want a specific classification from your cloud, here's how to filter it out using PDAL. We take advantage of the LAS specification, and know that the numerical tag for points of class 'building' is 6.

See Table 4.9 in this document: http://www.asprs.org/wp-content/uploads/2010/12/LAS_1-4_R6.pdf for a list of standard classification codes.

Save the following in merimbula_buildings.json:

{
    "pipeline": [
        "./merimbulatown_clip.las",
        {
            "limits": "Classification[6:6]",
            "type": "filters.range"
        },
        "./merimbulabuildings.las"
    ]
}

...and execute:

pdal pipeline merimbula_buildings.json

Time to execute: 6.071 seconds

...and add a new layer in cloudcompare - here the point cloud of buildings we just made is coloured by point source, since it contrasts nicely.


Extension

Can you do this job using las2las? Why would you use PDAL instead?

When writing your own data, using ASPRS classification codes to flag bui;dings, ground, trees etc. is a good start - it helps to integrate with a lot of existing analysis tools.

4. I don't like the vendor's ground classification, and want to try my own

Vendor-supplied ground can be poorly documented, or may not meet your requirements for other reasons. You can use PDAL to construct your own, with more control over how ground is parameterised. Also, PDAL implements some relatively recent ground classification algorithms.

Here use a Simple Morphological Filter (Pingel, T.J., Clarke, K.C., McBride, W.A., 2013. An improved simple morphological filter for the terrain classification of airborne LIDAR data. ISPRS J. Photogramm. Remote Sens. 77, 21–30.).

{
  "pipeline":[
    "./merimbulatown_clip.las",
    {
      "type":"filters.smrf",
      "extract": false,
      "cell": 2.0,
      "window": 42.0 
    },
    "./merimbulatown_smrf.laz"
  ]
}
pdal pipeline merimbula_smrf.json

This takes some time - and the results need inspecting. Did these parameters do a better job than the vendor?


These ground filters don't need to have LiDAR data to work, but they do rely on sufficiently dense input data. 3D photogrammetry works equally well, swath sonar is also a candidate - or essentially anything with points separated by a few metres or less. So the next few steps which rely on this - finding height above ground and making DEMs would work equally well on any dense data structure. Be sure to read the docs and tune your ground detection parameters to suit your data.

5. All I want is a better DEM than SRTM

PDAL and GDAL are hand-in-glove, wih PDAL employing GDAL's gdal2dem to create raster elevation models. Here we look at two types, the DSM and the DTM/DEM.

a. The easy one - a DSM

First a DSM - Digital Surface Model - meaning the top of everything in the point cloud.

This is relatively simple given what we've accomplished already. Here's the pipeline to do it:

{
    "pipeline": [
        "./merimbulatown_clip.las",
        {
            "filename":"./merimbula_dsm",
            "output_format":"tif",
            "output_type":"all",
            "grid_dist_x":"2.0",
            "grid_dist_y":"2.0",
            "type": "writers.p2g"
        }
    ]
}

...and we get the pattern by now:

pdal pipeline merimbula_dsm.json

Time to execute: 27.356 seconds

b. The slightly less easy one - the DTM/DEM

For this one we need to ignore everything that is not 'ground' - let's do this without creating a new .las file containing only ground points. In this example we pipe the result of a filter into another process - starting to show how PDAL operations can be chained. In the first block we limit our selection to points of class 2 (ASPRS LAS for 'ground'), and then pass the result to our raster generator in the second block.

{
    "pipeline": [
        "./merimbulatown_clip.las",
        {
            "limits": "Classification[2:2]",
            "type": "filters.range"
        },
        {
            "filename":"./merimbula_dtm",
            "output_format":"tif",
            "output_type":"all",
            "grid_dist_x":"2.0",
            "grid_dist_y":"2.0",
            "type": "writers.p2g"
        }
    ]
}

...and once more:

pdal pipeline merimbula_dtm.json

Time to execute: 12.34 seconds


Extension

  1. You can see that the raster elevation models in this example are hillshaded - replicate this! How many steps did your method take?
  2. We made a DTM on a 2 m grid. Make a 1 m DTM and compare the two (do we have enough point density to make a 1 m DTM? Or even a 2 m DTM?)

Quick aside - point density map

Not even a pipeline! PDAL can create a set of hexagonal bins to show the actual point coverage and density. While the metadata for this survey gives an average density for the entire dataset, what does it look like in real life?

First, let's set ourselves a sane sample area, say 20 m per cell. PDAL lets us set an edge size - so how long are edges in a 20 m^2 hexagon?


In [1]:
from math import sqrt
def hexside(area):
     return sqrt(area / (3 / 2 * sqrt(3)))

In [2]:
hexside(20)


Out[2]:
2.7745276335252114

OK, apply! This gives us a 20 m binned sqlite dataset that can be loaded straight into QGIS:

pdal density -i ./merimbulatown_clip.las -o merimbuladensity_20m.sqlite --filters.hexbin.edge_size=2.7745276335252114

Looking at our density map, classified by point count per hexagon, we see that most of the coverage has fewer than 3 points per square metre (cyan), and a reasonable proportion has fewer than 1 point per square metre (blue). The overall density metric has been tweaked a bit by swath overlaps, with up to 10 points per square metre!

Based on this information, a 2 m DTM is reasonable for a lot of the data (at least two samples per cell), but a 5 m DTM might be more realistic (at least 5 samples per cell).


Extension

How could we use PDAL to give us a consistent point density across our dataset?
*Hint: check out http://www.pdal.io/stages/filters.dartsample.html*

6. Where is the tallest building in Merimbula?

Sometimes we want relative height above ground (HAG). You might be interested in the tallest tree, or the mean height of shrubs, or in this case, finding the tallest building in a region. We now know how to pipe PDAL operations together, we're going to do some of that right now!

{
  "pipeline":[
    "merimbulatown_clip.las",
    {
      "type":"filters.height"
    },
    {
      "type":"filters.ferry",
      "dimensions":"HeightAboveGround = Z",
    },
    "merimbulatown_hag.las"
  ]
}
pdal pipeline merimbula_hag.json

Execution time 35 seconds

...gets us height above ground! Here's a profile view of the data. The new height-above-ground points are coloured, original data in grey (translated 100 m upward).

But where is the tallest building? One approach is to make a dataset of building heights and consult pdal info for some guidance.

nano find_tallest.json
{
  "pipeline":[
    "merimbulatown_hag.las",
    {
            "limits": "Classification[6:6]",
            "type": "filters.range"
    },
    "merimbulabuildings_hag.las"
  ]
}

This is our pipeline for selecting a single class. Use pdal info to find out how tall the tallest buidling is, and use it to guide the next pipeline - which adds a height filter and returns a comma delimited text file.

{
    "pipeline": [
        "./merimbula_hag.las",
        {
            "limits": "Classification[6:6]",
            "type": "filters.range"
        },
    {
            "limits": "Z[23.5:25]",
            "type": "filters.range"
        },
    "./merimbulabuildings_hag_tallest.txt"
    ]
}
pdal pipeline find_tallest.json

...which we can inspect in our QGIS project. Load the CSV file up and see where Merimbula's tallest 'building' is - styled as a red dot in the map below. Do you think they're really buildings?


Extension

  1. What's the address of the building?
  2. How can I quickly assess whether the points identified are really a building?

This is a pretty clunky way of finding the highest point in a dataset. An alternative might be writing a custom filter - or using PDAL's python binding to ingest data into a numpy array. For the second approach, just be mindful of memory consumption.

***Note:*** using the PDAL-python approach, you could achieve most of the outcomes here using array operations - the main limitation being that you need to ingest all the data into memory at once (PDAL streams data - keeping only chunks at a time in memory).

7. Awesome! But my points are not .LAS tiles

PDAL has a wide range of readers built in - including a configurable text reader. Because is it is built with GDAL in mind it will generally ingest GDAL formats.

Let's try reading a gridded NetCDF file as a point cloud:

{
  "pipeline":[
    {
      "type":"readers.gdal",
      "filename":"./IR_gravity_anomaly_Australia_V1.nc"
    },
    {
      "type":"filters.ferry"
      "dimensions":"Latitude=Y, Longitude=X, grav_ir_anomaly=Z",
    },
    {
      "type":"writers.laz",
      "filename":"./gravityanomaly.laz"
    }
  ]
}

Did this work? No! Why not? Because we have not built (yet) a schema to read out the gridded gravity anomaly!

Why include a broken example? To inspire you! What could we do if this function worked?

Also waiting to see if I can get a hold of some experimental geophysics point data... depending on how they're packed this might work a lot more easily! Then the open-ended question is 'why use PDAL and not a natove netCDF IO?'

Summary

Points are no longer your enemy! Using a completely open source stack and the VDI, you can whip point data into shape quickly and easily. You don't need to copy the raw tiles anywhere, only your derived datasets.

Still, keep an eye on those - they can get quite large.

Just to reiterate, PDAL and cloudcompare will happily handle data from 3D photogrammetry, terrestrial scanners, mobile (non airborne scanners), Zebedees, sonars, radars.. whatever produces data which are arranged as points in space.

Future things

This workbook only goes through simple pipelines - a more complex example might clip, classify, and compute height above ground in one step.

PDAL also underpins processing for visualisations like web-based massive point cloud rendering - we've used PDAL to apply colour from aerial imagery to LiDAR points prior to indexing for interactive web display.

PDAL Python bindings are also on the way - meaning that point data can be piped straight into Numpy arrays. The disadvantage of this approach is that all the points need to be held in memory at once, so the subsetting feature demonstrated here will be a critical first step.

Caveats

  • PDAL does not yet handle full waveform LiDAR data. We're looking at PyLIDAR as a package for handling full waveform data, alongside other methods for handling points + pulses.
  • PDAL is not parallel by nature (yet). Processing multiple tiles can be done using a parallelisation wrapper (eg Gnu Parallels), but tasks need to fit. We're not ready for vastly parallel jobs with PDAL just yet!

Where to get help?

...ask us! But we don't know everything - so we might point you to:

...but PDAL doesn't do my specialised thing X, why not?

Because you didn't build it yet! Fork PDAL and have your merry way with it: https://github.com/PDAL/PDAL

Is PDAL ready to use for analysis, given that this workbook identifies some unresolved issues?

Yes! PDAL is actively maintained and developed - with a lot of interest in the community to keep it's momentum up. The glitches we see all have workarounds - and their presence on the list of issues means that they are being attended to.